Take home messages

Precision, imprecision, higher-order probabilities

Higher order probabilities (HOP) deal better with problems faced by precise and imprecise probabilism (PP/ IP).

Weight of evidence

HOPs allow for a principled, information-theoretic account of weight of evidence, which works better than previous proposals within PP and IP.

Reasoning

The approach is computationally feasible and can be implemented with Bayesian Networks.

Precise probabilism (PP)

A rational agent’s (RA) degrees of belief are to be represented by means of a single probability measure defined over every proposition she entertains

Example: fair coin

\[ \mathsf{P}(H) = \mathsf{P}(\neg H)=.5\]

Example: Unknown bias

\[\mathsf{P}(H) = \mathsf{P}(\neg H)=.5 \]

Imprecision and evidence responsiveness

Locality

RA’s credal stance (in a wide, non-technical sense) about a proposition is to be captured by whatever probability (or probabilities) she assigns to it, and does not depend on what probabilities RA assigns to logically independent propositions

Trouble with evidence responsiveness

PP can’t distinguish between (Fair coin) and (Unknown bias) and multiple other cases

Trouble with sweetening

If RA doesn’t know what the bias of the coin is, learning that it now has increased by .001, might still leave RA undecided

Imprecise probabilism (IP)

Representors

RA’s credal stance towards \(H\) is to be represented by means of a set of those probability measures \(\mathbb{P}\), which are compatible with evidence

(Bradley, 2019; Fraassen, 2006; Gärdenfors & Sahlin, 1982; Joyce, 2005; Kaplan, 1968; Keynes, 1921; Levi, 1974; Sturgeon, 2008; Walley, 1991)

Evidence responsiveness

  • Fair coin: as PP

  • Unknown bias: all possible probability measures

Indifference and indecision

  • Indifference: \(A\) and \(B\) are equally likely

  • Indecision: no comparison, no such determination

(Kaplan, 1968)

Probabilistic opinion pooling

Independence preservation

  • If every member agrees that \(X\) and \(Y\) are probabilistically independent, the aggregated credence should respect this

  • fails on linear pooling and PP

  • bunch of other limitative results

    (Dietrich & List, 2016)

The IP approach to pooling

Bundle them up into a set of measures!

Stewart & Quintana (2018)

Some caveats

Learning and imprecision

RA’s representor should be updated point-wise:

\[ \mathbb{P}_{t_1} = \{\mathsf{P}_{t_1}\vert \exists\, {\mathsf{P}_{t_0} \!\in \mathbb{P}_{t_0}}\,\, \forall\, {H}\,\, \left[\mathsf{P}_{t_1}(H)=\mathsf{P}_{t_0}(H \vert E)\right] \}. \]

MaxEnt as our guide

Start learning with a distribution that—given the evidence available—is maximally noncommittal with regard to missing information

(Jaynes, 2003; Williamson, 2010)

Supervaluationist comparison

RA is more confident of \(A\) than \(B\) just in case all members of RA’s representor prefer \(A\)

Challenges to IP

More evidence responsiveness

Two biases

\(\mathsf{P}_1, \mathsf{P}_2\) such that \(\mathsf{P}_1(H)=.4\) and \(\mathsf{P}_2(H)=.6\)?

Two unbalanced biases?

Comparison revisited: Rinard’s mysteru urns

  • contains only green marbles

  • no information about

Intuitions here

  • RA should be certain that the marble drawn from will be green (\(G\)),

  • RA should be more confident about \(G\) than that the marble from will be green (\(M\))

The trouble for IP

  • For each \(r\in [0,1]\) RA’s representor contains a \(\mathsf{P}\) with \(\pr{M}=r\)

  • Including the one with \(\pr{M}=1\)

  • So it is not the case that for any of RA’s representor \(\mathsf{P}\), \(\mathsf{P}(G) > \mathsf{P}(M)\)

  • So—on IP—RA does not prefer \(G\) over \(M\)

    (Rinard, 2013)

Belief inertia

Setup and intuition

Trouble for PP

By (PIE), \(\mathsf{P}_0(H)=.5\) in Stage 0 updates to \(\mathsf{P}_1(H)=.5\) in Stage 1

Belief inertia

Trouble for IP

(Levi, 1980)

Belief inertia

Trouble for IP

(Levi, 1980)

Belief inertia

Trouble for IP

(Levi, 1980)

Balance vs. weight

Precursors

  • Beans from a bag, two colors, same observed proportion, different sample sizes (C. S. Peirce, 1872).

  • The notion of weight: balance might remain the same while the amount of relevant evidence shifts (Keynes 1921).

Desiderata

  • Balance undetermination Different weight with the same balance are possible.

  • Weak (strong) increase In Bernoulli trials, weight does not decrease (increases) with sample size keeping frequency fixed.

  • Frequency monotonicity In Bernoulli trials, keeping sample size fixed, weight does not decrease as observed frequency goes further from .5.

No unrestricted monotonicity

Weatherson, Joyce, Runde

  • Straight flush has the probability of \(\frac{40}{2,958,960}\).

  • The player starts behaving confusingly and bluffing.

Weight and precise probablism

Hamer’s certainty

Hamer’s absolute distance from 1 or 0 depending on the balance fails at Balance undetermination.

Good’s desiderata and weight

  • \(W(H:E)\) is some function of \(\mathsf{P}(E\vert H), \mathsf{P}(E\vert \neg H)\)

  • \(\mathsf{P}(H \vert E) = g[W(H:e), \mathsf{P}(H)]\)

  • \(W(H: E_1 \wedge E_2) = W(H:E_1) + W(H:E_2 \vert E_1)\)

\[W(H:E) = \log \frac{\mathsf{P}(E \vert H)}{\mathsf{P}(E\vert \neg H)}\]

Good’s weight is not what we’re after

Good’s own example (expanded)

  • a die is selected at random from nine fair dice and one with bias \(\frac{1}{3}\).

  • Uniform prior gives you weight of evidence for the loaded die \(log_{10}(.1)\), that is -1 (-10 db).

  • Every time you toss it and obtain a six, you gain \(log_{10}(\frac{\frac{1}{3}}{\frac{1}{6}})= log_{10}(2)\)

  • Every time you toss it and obtain something else, the weight changes by \(log_{10}(\frac{\frac{2}{3}}{\frac{5}{6}})= log_{10}(.8)\).

Good’s weight fails at weak increase

Intervals

Kyburg’s Evidential Probability

\(\mathsf{EP}(H \vert E \wedge K) = [x,y]\)

  • Sharpening by richness (prefer frequencies from full joint distributions)

  • Sharpening by specificity (prefer proper subsets)

  • Sharpening by precision (pick single subinterval if it exists, otherwise, shortest possible cover of minimal subintervals)

Pedden’s weight

Let \(\mathsf{EP}(H \vert E \wedge K) = [x,y]\), then:

\(\mathsf{WK(H\vert E\wedge K)} = 1 - (y-x)\).

Troubles with EP

  • Pedden picks edges by error margins (sensitivity!)

  • Also, sensitivity to what happens around the edges only.

  • How to deploy outside of combinatorial or frequentist contexts?

  • Reasoning with intervals hard to model sensibly (does not preserve structural information)

Imprecise probabilities

Challenges to precise probabilism

  • insufficient responsiveness to evidence

  • fails to model indifference as sensitive to sweetening

  • can’t distinguish between lack of knowledge and knowledge that \(\mathsf{P}(X)=.5\)

  • trouble with aggregation methods (independence preservation etc.)

Representor with pointwise Bayesian learning

\(\mathbb{P}_{t_1} = \{\mathsf{P}_{t_1}\vert \exists\, {\mathsf{P}_{t_0} \!\in \mathbb{P}_{t_0}}\,\, \forall\, {H}\,\, \left[\mathsf{P}_{t_1}(H)=\mathsf{P}_{t_0}(H \vert E)\right] \}\)

Joyce: \(w(X,E) = \sum_x \vert c(ch(X) = x \vert E) \times (x - c(X\vert E))^2 - c(ch(X) = x) \times (x - c(X))^2\vert\)

hypotheses .4 .5 .6
credences 1/3 1/3 1/3
\(c(X) = \sum_x c(Ch(X)=x)x\) .5 .5 .5
\(c(E \vert ch(X) =x)\) .042 .117 .214
\(c(E) = \sum_x c(E \vert ch(X) =x) c(ch(X)=x)\) .124 .124 .124
\(c(ch(X)=x \vert E)\) .113 .312 .573
\(c(X|E) = \sum_x c(Ch(X)=x\vert E)x\) .54 .54 .54
prior weights 0.01 0 .01
posterior weights .021 .002 .002
w .0066 .0066 .0066

Problems with Joyce’s weight

  • Unintuitive behavior around chance hypotheses

  • Failure of weak increase

  • no real use of representors

  • need to use of distributions over chance hypotheses

  • taking credence to be the expected value is non-trivial

General problems with IP

Still not evidence sensitive

You still need to go higher-order to model some cases (e.g. uneven bias).

Belief inertia

Point-wise updating can’t make you leave the set of all possible measures.

Unclear mechanism of evidential constraints

  • “Drop measures excluded by the evidence”. But how (other than degenerate cases)?

  • How exactly does non-testimonial evidence of chances \(\{ \mathsf{P}(X) = x\}\) or \(\mathsf{P}(X) \in [x,y]\) is supposed to arise?

General problems with IP

Wrong comparative predictions (Rinard)

  • contains only green marbles, no information about

  • A marble will be drawn at random from each. You should be certain that the marble drawn from will be green (\(G\)), and \(G>M\).

  • IP: for each \(r\in [0,1]\) your representor contains a \(\mathsf{P}\) with \(\mathsf{P}(M)=r\).

  • But then, it also contains one with \(\mathsf{P}(M)=1\).

  • So not for all \(\mathsf{P}\): \(\mathsf{P}(G) > \mathsf{P}(M)\), \(G\neq M\)!

General problems with IP

Proper scoring rule is impossible

See results by Seidenfeld 2012, Mayo-Wilson 2016, Schoenfield 2017, Cambell-Moore 2020

Aggregating doesn’t fly far

  • Taking unions leads to skepticism. What else?

  • Can’t model synergy

Second-order approach to uncertainty

Key idea

Often, uncertainty is not a single-dimensional thing to be mapped on a single one-dimensional scale such as a real line. It is the whole shape of the whole distribution over parameter values that should be taken under consideration. Summaries are just that.

Some simple examples

Dealing with problems for IP

  • Much more evidence sensitive (and can go higher order if needed).

  • Seamless integration with Bayesian statistics explains learning.

  • Belief inertia does not arise.

  • HPDI comparison avoids Rinard’s objection.

  • There is a proper scoring rule (Urbaniak 2022).

  • Evidence-based aggregation is accuracy-wise better than averaging and can model synergy.

Information theory crash-course

  • \(m=8\) possible destinations can be reached by making decisions at \(\log_2(8)=3\) forks.

  • Surprise: \(1/\mathsf{P}(x)\)

  • Shannon information: $ log_2(\mathsf{surprise)) = - log_2((x)) $

  • Entropy is average Shannon information:

\[H(X) = \sum \mathsf{P}(x_i) \log_2 \frac{1}{\mathsf{P}(x_i)} = - \sum \mathsf{P}(x_i) \log_2 \mathsf{P}(x_i)\]

(the expected amount of information you receive once you learn what the value of \(X\) is).

Cross-entropy and KLD

Say events arise according to a distribution \(\mathsf{P}\) but we predict them using a distribution \(\mathsf{Q}\).

Cross-entropy

\[\mathsf{H}(\mathsf{P}, \mathsf{Q}) = \sum \mathsf{P}_i \log_2(\mathsf{Q}_i)\]

Kullback-Leibler divergence

\[\mathsf{KLD}(\mathsf{P}, \mathsf{Q}) = H(\mathsf{P}, \mathsf{Q}) - H(\mathsf{P})\\ = - \sum \mathsf{P}_i \log_2(\mathsf{Q}_i) - \left( - \sum \mathsf{P}_i \log_2 \mathsf{P}_i\right) \\ = - \sum \mathsf{P}_i\left( \log_2 \mathsf{Q}_i - \log_2\mathsf{P}_i\right)\\ = \sum \mathsf{P}_i\left( \log_2 \mathsf{P}_i - \log_2\mathsf{Q}_i\right)\\ = \sum \mathsf{P}_i \log_2 \left( \frac{\mathsf{P}_i}{\mathsf{Q_i}}\right) \]

Weight of a distribution

Key idea

The more informative a piece of evidence is, as compared to the uniform distribution, the more weight it has, on scale 0 to 1

\[\mathsf{w(P_i)} = 1 - \left( \frac{H(\mathsf{P})}{H(\mathsf{uniform})}\right)\]

Weight of a distribution

Weak increase holds

Works for a variety of shapes

Abuse and rocking example

Weights in BNs

Weights in BNs

Weights in BNs

Expected weight

Wrapping up

The higher-order approach

  • Leads to more honesty in uncertainty assessment
  • Is more sensible than sensitivity analysis
  • Integrates with Bayesian data analysis
  • Leads to an information-theoretic account of evidential weight
  • Is computationally feasible

Other things I wish I had time to discuss

  • connections with precise vs. imprecise probabilism in formal epistemology

  • problems with existing opinion aggregation methods and a higher-order approach

  • modeling synergy of multiple sources of information

  • further properties of weight

  • relation to evidential completeness

Thank you!

Bradley, S. (2019). Imprecise Probabilities. In E. N. Zalta (Ed.), The Stanford encyclopedia of philosophy ( Spring 2019). https://plato.stanford.edu/archives/spr2019/entries/imprecise-probabilities/; Metaphysics Research Lab, Stanford University.

Dietrich, F., & List, C. (2016). Probabilistic opinion pooling. In A. Hajek & C. Hitchcock (Eds.), Oxford handbook of philosophy and probability. Oxford: Oxford University Press.

Elkin, L., & Wheeler, G. (2018). Resolving peer disagreements through imprecise probabilities. Noûs, 52(2), 260–278. https://doi.org/10.1111/nous.12143

Fraassen, B. C. V. (2006). Vague expectation value loss. Philosophical Studies, 127(3), 483–491. https://doi.org/10.1007/s11098-004-7821-2

Gärdenfors, P., & Sahlin, N.-E. (1982). Unreliable probabilities, risk taking, and decision making. Synthese, 53(3), 361–386. https://doi.org/10.1007/bf00486156

Jaynes, E. T. (2003). Probability theory: The logic of science. Cambridge university press.

Joyce, J. M. (2005). How probabilities reflect evidence. Philosophical Perspectives, 19(1), 153–178.

Kaplan, J. (1968). Decision theory and the fact-finding process. Stanford Law Review, 20(6), 1065–1092.

Keynes, J. M. (1921). A treatise on probability, 1921. London: Macmillan.

Levi, I. (1974). On indeterminate probabilities. The Journal of Philosophy, 71(13), 391. https://doi.org/10.2307/2025161

Levi, I. (1980). The enterprise of knowledge: An essay on knowledge, credal probability, and chance. MIT Press.

Rinard, S. (2013). Against radical credal imprecision. Thought: A Journal of Philosophy, 2(1), 157–165. https://doi.org/10.1002/tht3.84

Stewart, R. T., & Quintana, I. O. (2018). Learning and pooling, pooling and learning. Erkenntnis, 83(3), 1–21. https://doi.org/10.1007/s10670-017-9894-2

Sturgeon, S. (2008). Reason and the grain of belief. No ûs, 42(1), 139–165. Retrieved from http://www.jstor.org/stable/25177157

Walley, P. (1991). Statistical reasoning with imprecise probabilities. Chapman; Hall London.

Williamson, J. (2010). In defence of objective bayesianism. Oxford University Press Oxford.